Use less CPU and RAM #549

daladim · 2023-02-07T22:59:46Z

Refs #537 .

This PR:

uses img2pdf instead of imagemagick to create PDFs. This saves much CPU and RAM usage, because this merely "staples together" various JPG files, whereas imagemagick would load every JPG into memory, do some kind of transcoding then writes
them to a PDF file.
when image transcoding/compression is needed, it does this file by file, rather than "all files at once". This avoids imagemagick trying to load all of them to memory before it converts them.

This makes some pipelines slightly faster to process, but this improves RAM usage a lot for most of them when multiple pages are scanned. On older Raspberry Pis, this fixes the "pipeline has stalled because RAM is full" issue.

Disclaimers:

I'm not a JS developer, maybe there's a more idiomatic way to add a final newline in the first commit
I have tested with my scanner and my Raspberry Pi. In case you have more scanners and more devices to test, feel free to do so :)

This only refs #537 and does not fix it, because there's still the "transcode every page when the next page is being scanned, instead of transcoding all of them at the end" improvement that could be done one day.

Thanks

This helps pipeline scripts to parse them

img2pdf only "staples together" various JPG files, whereas imagemagick loads every JPG into memory, does some kind of transcoding _then_ writes them to a PDF file. This change improves CPU and RAM usage when scanning to PDF files. Refs sbs20#537

sbs20 · 2023-02-15T22:36:11Z

To save me digging through and testing, why does there have to be an additional new line added at the end? Is it the while read or xargs?

daladim · 2023-02-15T23:44:27Z

I'm not sure about xargs. But read only splits strings at newlines, so it ignores the final line that does not finish with a LF.

There are tricks to also handle the final line in a loop, such as while read line || [ -n "$line" ]; do ...; done, but well, it feels that's better just to add a final LF :)

sbs20 · 2023-02-16T00:34:26Z

That link was fascinating, thank you. Let me ponder it further.

sbs20 · 2023-02-28T11:23:37Z

First - thank you so much for taking the time to do this and raise such a well documented PR.

Second, sorry it's taken until now to reply.

I have given this a LOT of thought. And I have decided not to merge the PR. The main problem for me is the additional dependencies of img2pdf. There are users who really don't want additional dependencies and even go so far as to edit the Dockerfile to reduce the image size. It should not stop you from running your code - although you will need to adapt your config.local.js to set your pipelines.

What I have done is raise #569 which will make it so your pipelines will work easily.

What I am also going to do is see about merging in the while read filename... code - so I am not yet closing this PR as I want to cherrypick.

daladim · 2023-02-28T14:01:46Z

Thanks for your message. img2pdf indeed is quite a large dependency, especially as it pulls the whole Python runtime. I understand that you are wary including it.

As you said, it does not prevent me from using it, I'll just have to include my own pipelines in config.local.js.
I am using Docker though. I had tried to make an extended image using a Dockerfile that reads FROM sbs20/scanservjs:... ; RUN apt-get update && apt-get install -y img2pdf. However, when running docker build, apt-get fails for some reason (I cannot reproduce as I cannot access my raspberry pi right now). Something about missing permissions to perform whatever late installation step. Does this ring a bell to you? Do you know how I could still install img2pdf in an extended Docker image, without building the whole image from scratch?
Not upstreaming these changes is too bad after all, because it makes scanservjs not work out-of-the-box for raspberry pi users that have a scanner with ADF. That's too bad because scanservjs is, by far, the best (and only?) solution I found to scan on a headless raspberry pi.
Img2pdf documentation, in its last paragraph, suggests tesseract is able to create PDFs without re-encoding. That would be great because it already is in the Docker image, but I could not make it work without having it insist on doing any OCR. Do you think it is possible though? You're probably more familiar with tesseract as I am.
Otherwise, do you think you would be OK to either
- publish two images (say, the "regular" scanservjs:$version image, plus an image tagged scanservjs:$version-with-extras)?
- Even without img2pdf, there still is the "file-by-file conversion" from this PR that does not require any extra dependency, that seems useful for lower-end machines (and that, I think, does not cause any harm on more up-to-date machines).
- At least write instructions in documentation files for raspberry pi users?

Out of curiosity, what machine are you using scanservjs on?
Thanks for your feedback!

sbs20 · 2023-02-28T20:59:43Z

Something about missing permissions to perform whatever late installation step. Does this ring a bell to you?

It all depends on what the ... is in FROM sbs20/scanservjs:.... If you're using this then it switches user and you won't have sufficient privs. You should be using this instead. If it's not that, then I don't know I'm afraid.

But I could not make it work without having it insist on doing any OCR. Do you think it is possible though?

No idea I'm afraid - I am actually fairly unfamiliar with Tesseract, sorry!

publish two images (say, the "regular" scanservjs:$version image, plus an image tagged scanservjs:$version-with-extras)

I would gladly link to someone else's image. I am less keen to add and maintain yet another image. I had never intended to support Docker - it adds a lot of support overhead, but somehow did anyway; this project is already placing heavy demands on my time.

there still is the "file-by-file conversion" from this PR that does not require any extra dependency

Yep - and I may yet include that, I just couldn't find the time to get it done and fully tested today. That's what I meant above with What I am also going to do is see about merging in the while read filename... code. I wanted to make the change in #569 which would at least mean you could use standard configuration changes.

At least write instructions in documentation files for raspberry pi users

Indeed. I do always try to do this - but I would welcome contributions.

Related to #537 This has cherry picked changes from #549 - essentially to change from using `convert @-` to a bash while loop.

Follow on to correct #549

daladim added 3 commits February 7, 2023 17:00

Collator: add a final newline in the list of files passed to pipelines

394d067

This helps pipeline scripts to parse them

Added img2pdf dependency

0c43819

Perfs: using img2pdf in PDF pipelines

42dae0b

img2pdf only "staples together" various JPG files, whereas imagemagick loads every JPG into memory, does some kind of transcoding _then_ writes them to a PDF file. This change improves CPU and RAM usage when scanning to PDF files. Refs sbs20#537

daladim force-pushed the less_cpu_and_ram branch from 9cbd4da to 42dae0b Compare February 7, 2023 23:04

sbs20 mentioned this pull request Feb 28, 2023

Add newline to pipeline stdin for POSIX compliance #568

Closed

sbs20 changed the base branch from master to iss537 March 11, 2023 17:14

Merge branch 'iss537' into less_cpu_and_ram

9c1e606

sbs20 merged commit 26dce7b into sbs20:iss537 Mar 11, 2023

sbs20 added a commit that referenced this pull request Mar 11, 2023

convert: use less RAM

1c9193a

Related to #537 This has cherry picked changes from #549 - essentially to change from using `convert @-` to a bash while loop.

sbs20 added a commit that referenced this pull request Mar 12, 2023

convert: use less RAM

7383750

Related to #537 This has cherry picked changes from #549 - essentially to change from using `convert @-` to a bash while loop.

sbs20 added a commit that referenced this pull request May 7, 2023

convert: use less RAM

a99588c

Related to #537 This has cherry picked changes from #549 - essentially to change from using `convert @-` to a bash while loop.

sbs20 added a commit that referenced this pull request May 7, 2023

convert: use less RAM

53baf59

Related to #537 This has cherry picked changes from #549 - essentially to change from using `convert @-` to a bash while loop.

sbs20 added a commit that referenced this pull request Sep 27, 2023

convert: use less RAM

72301a0

Related to #537 This has cherry picked changes from #549 - essentially to change from using `convert @-` to a bash while loop.

sbs20 mentioned this pull request Sep 27, 2023

convert: use less RAM #646

Merged

sbs20 added a commit that referenced this pull request Sep 27, 2023

convert: use less RAM

8d2c7fe

Related to #537 This has cherry picked changes from #549 - essentially to change from using `convert @-` to a bash while loop.

sbs20 added a commit that referenced this pull request Oct 13, 2023

Fix pipelines

4514313

Follow on to correct #549

sbs20 added a commit that referenced this pull request Oct 16, 2023

Fix pipelines

b777ae5

Follow on to correct #549

sbs20 added a commit that referenced this pull request Oct 16, 2023

Fix pipelines

ac576cf

Follow on to correct #549

sbs20 mentioned this pull request Oct 16, 2023

Fix pipelines #657

Merged

sbs20 added a commit that referenced this pull request Oct 16, 2023

Fix pipelines

bf2de32

Follow on to correct #549

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use less CPU and RAM #549

Use less CPU and RAM #549

daladim commented Feb 7, 2023 •

edited

Loading

sbs20 commented Feb 15, 2023

daladim commented Feb 15, 2023

sbs20 commented Feb 16, 2023

sbs20 commented Feb 28, 2023

daladim commented Feb 28, 2023 •

edited

Loading

sbs20 commented Feb 28, 2023

Use less CPU and RAM #549

Use less CPU and RAM #549

Conversation

daladim commented Feb 7, 2023 • edited Loading

sbs20 commented Feb 15, 2023

daladim commented Feb 15, 2023

sbs20 commented Feb 16, 2023

sbs20 commented Feb 28, 2023

daladim commented Feb 28, 2023 • edited Loading

sbs20 commented Feb 28, 2023

daladim commented Feb 7, 2023 •

edited

Loading

daladim commented Feb 28, 2023 •

edited

Loading